$S^3$ - Statistical Sandhi Splitting
نویسندگان
چکیده
The problem of Sam. dhi-Splitting is central to computational processing of Sanskrit texts. Currently the best-known algorithm for this task, given a chunk, generates all possible splits and chooses the Maximum-a-Posteriori estimate as the final answer. Our contributions to the task of Sam. dhi-Splitting are two-fold. Firstly, we improve upon the current algorithm by proposing a principled modification of the posterior probability function to achieve better results. Secondly, we propose an algorithm based on Bayesian Word-Segmentation methods. We find that the unsupervised version of our algorithm achieves a better precision than the current algorithm with the original probabilistic model. We then present a supervised version of our algorithm that outperforms all previous methods/models.
منابع مشابه
Statistical Sandhi Splitter for Agglutinative Languages
Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopte...
متن کاملSanskrit Sandhi Splitting using $\pmb{seq2(seq)^2}$
In Sanskrit, small words (morphemes) are combined through a morphophonological process called Sandhi to form compound words. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing the splitting of words exist, it is highly challenging to identify the location of the splits in a compound word, as the same compound word might be...
متن کاملA Sandhi Splitter for Malayalam
Sandhi splitting is the primary task for computational processing of text in Sanskrit and Dravidian languages. In these languages, words can join together with morpho-phonemic changes at the point of joining. This phenomenon is known as Sandhi. Sandhi splitter splits the string of conjoined words into individual words. Accurate execution of sandhi splitting is crucial for text processing tasks ...
متن کاملExternal Sandhi and its Relevance to Syntactic Treebanking
External sandhi is a linguistic phenomenon which refers to a set of sound changes that occur at word boundaries. These changes are similar to phonological processes such as assimilation and fusion when they apply at the level of prosody, such as in connected speech. External sandhi formation can be orthographically reflected in some languages. External sandhi formation in such languages, causes...
متن کاملStatistical Modeling of Mandarin Tone Sandhi: Neutralization of Underlying Pitch Targets
This study statistically models the surface f0 contour and the underlying pitch target of a well-studied third sandhi tone of Mandarin Chinese. Although the growth curve analysis on the surface f0 contours indicates non-neutralization of this sandhi tone (T3) and the base T2, their underlying pitch targets do show neutralization. These results in Mandarin are also consistent with the perception...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011